Distributing files across multiple, permissibly heterogeneous, storage devices

ABSTRACT

A file system (i) permits storage capacity to be added easily, (ii) can be expanded beyond a given unit, (iii) is easy to administer and manage, (iv) permits data sharing, and (v) is able to perform effectively with very large storage capacity and client loads. State information from a newly added unit is communicated (e.g., automatically and transparently) to central administration and management operations. Configuration and control information from such operations is communicated (e.g., automatically) back down to the newly added units, as well as existing units. In this way, a file system can span both local storage devices (like disk drives) and networked computational devices transparently to clients. Such state and configuration and control information can include globally managed segments as the building blocks of the file system, and a fixed mapping of globally unique file identifiers (e.g., Inode numbers) and/or ranges thereof, to such segments.

§ 0. RELATED APPLICATIONS

Benefit is claimed, under 35 U.S.C. § 119(e)(1), to the filing date ofprovisional patent application Ser. No. ______ (Attorney Docket No.58300-012), entitled “ENTERPRISE DATA STORAGE SYSTEM AND METHOD”, filedon Sep. 12, 2000 and listing David Chrin, Steven Orszag, Michael Brennanand Philip Eric Jackson as inventors, for any inventions disclosed inthe manner provided by 35 U.S.C. § 112, ¶ 1. This provisionalapplication is expressly incorporated herein by reference.

§ 1. BACKGROUND OF THE INVENTION

§ 1.1 Field of the Invention

The present invention concerns computer storage and file systems. Morespecifically, the present invention concerns techniques for managing andusing a distributed storage system.

§ 1.2 Related Art

Data generated by, and for use by, computers is stored in file systems.The design of file systems has evolved in the last two decades,basically from a server-centric model (which can be thought of as alocal file system), to a storage-centric model (which can be thought ofas a networked file system).

Stand alone personal computers exemplify a server-centric model—storagehas resided on the personal computer itself, initially using hard diskstorage, and more recently, optical storage. As local area networks(“LANs”) became popular, networked computers could store and share dataon a so-called file server on the LAN. Storage associated with a givenfile server is commonly referred to as server attached storage (“SAS”).Storage could be increased by adding disk space to a file server.Unfortunately, however, SASs are only expandable internally—there is notransparent data sharing between file servers. Further, with SASs,throughput is limited by the speed of the fixed number of bussesinternal to the file server. Accordingly, SASs also exemplify aserver-centric model.

As networks became more common, and as network speed and reliabilityincreased, network attached storage (“NAS”) has become popular. NASs areeasy to install and each NAS, individually, is relatively easy tomaintain. In a NAS, a file system on the server is accessible from aclient via a network file system protocol like NFS or CIFS. Network filesystems like NFS and CIFS are layered protocols that allow a client torequest a particular file from a pre-designated server. The client'soperating system translates a file access request to the NFS or DFSformat and forwards it to the server. The server processes the requestand in turn translates it to a local file system call that accesses theinformation on magnetic disks or other storage media. The disadvantageof this technology is that a file system cannot expand beyond the limitsof single NAS machine. Consequently, administering and maintaining morethan a few NAS units, and consequently more than a few file systems,becomes difficult. Thus, in this regard, NASs can be thought of as aserver-centric file system model.

Storage area networks (SANs) (and clustered file systems) exemplify astorage-centric file system model. SANs provide a simple technology formanaging a cluster or group of disk-storage units, effectively poolingsuch units. SANs use a front-end system, which can be a NAS or atraditional server. SANs are (i) easy to expand, (ii) permit centralizedmanagement and administration of the pool of disk storage units, and(iii) allow the pool of disk storage units to be shared among a set offront-end server system. Moreover, SANs enable various dataprotection/availability functions such as multi-unit mirroring withfailover for example. Unfortunately, however, SANs are expensive.Although they permit space to be shared among front-end server systems,they don't permit multiple SANs environments to use the same filesystem. Thus, although SANs pool storage, they basically behave as aserver-centric file system. That is, like a fancy (e.g., with advanceddata protection and availability functions) disk drive on a system.Finally, various incompatible versions of SANs have emerged.

The article, T. E. Anderson et al., “Serverless Network File Systems,”Proc. 15^(th) ACM Symposium on Operating System Principles, pp. 109-126(1995) (hereafter referred to as “the Berkeley paper”) discusses adata-centric distributed file system. In the system, manager maps, whichmap a file to a manager for controlling the file, are globally managedand maintained. Unfortunately, the present inventors believe thatmaintaining and storing a map having every file could limit scalabilityof the system as the number of files become large.

§ 1.3 Unmet Needs

In view of the foregoing disadvantages of known storage technologies,such as the server-centric and storage-centric models described above,there is a need for a new storage technology that (i) permits storagecapacity to be added easily (as is the case with NASs), (ii) thatpermits file systems to be expanded beyond a given unit (as is the casewith SANs), (iii) that are easy to administer and manage, (iv) thatpermit data sharing, (v) and are able to perform effectively with verylarge storage capacity and client loads.

§ 2. SUMMARY OF THE INVENTION

The present invention may provide methods, apparatus and data structuresfor providing a file system which meets the needs listed in § 1.3. Adistributed file system in which files are distributed across more thanone file server and in which each file server has physical storage mediamay be provided. The present invention can determine a particular fileserver to which a file system call pertains by (a) accepting a filesystem call including a file identifier, (b) determining a contiguousunit of the physical storage media of the file servers of thedistributed file system based on the file identifier, (c) determiningthe file server having the physical storage media that contains thedetermined contiguous unit, and (d) forwarding a request, based on thefile system call accepted, to the file server determined to have thephysical storage media that contains the determined contiguous unit.

The file identifier may be an Inode number and the contiguous unit maybe a segment. The file server having the physical storage media thatcontains the determined contiguous unit may be determined by a table,administered globally across the file system, that maps the contiguousunit to (the (e.g., IP) address of) the file server.

§ 3. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an exemplary environment in which variousaspects of the present invention may take place.

FIG. 2 is a process bubble diagram of operations that may be carried outby various exemplary apparatus used in the environment of FIG. 1.

FIG. 3 is a block diagram of an exemplary data structure of a storagemedium, such as a disk-based storage medium.

FIG. 4 is a block diagram of an exemplary table data structure that maybe used to map segment numbers to an identifier (e.g., an address) of afile server storing the segment.

FIG. 5 is a flow diagram of an exemplary method that may be used toeffect a file system call translation operation.

FIG. 6 is a flow diagram of an exemplary method that may be used toeffect a transaction routing operation.

FIG. 7 is a flow diagram of an exemplary method that may be used toeffect a network interface operation.

FIG. 8 is a flow diagram of an exemplary method that may be used toeffect local file operations.

FIG. 9 is a block diagram of apparatus on which various operations ofthe present invention may be effected, and on which various datastructures and files may be stored.

FIG. 10 is a messaging diagram that illustrates a read operation in anexemplary embodiment of the present invention.

FIG. 11 is a messaging diagram that illustrates a write operation in anexemplary embodiment of the present invention.

§ 4. DETAILED DESCRIPTION

The present invention involves novel methods, apparatus and datastructures for providing advanced data storage. The followingdescription is presented to enable one skilled in the art to make anduse the invention, and is provided in the context of particularapplications and their requirements. Various modifications to thedisclosed embodiments will be apparent to those skilled in the art, andthe general principles set forth below may be applied to otherembodiments and applications. Thus, the present invention is notintended to be limited to the embodiments shown and the inventorsregards their invention as the following disclosed methods, apparatus,articles of manufacturers, and data structures and any other patentablesubject matter to the extent that they are patentable.

In the following, environments in which the present invention may beemployed are introduced in § 4.1. Then, functions that may be performedby the present invention are introduced in § 4.2. Then, operations, datastructures, methods and apparatus that may be used to effect thosefunctions are described in § 4.3. Thereafter, examples of how exemplaryparts of the present invention may operate is described in § 4.4.Finally, some conclusions about the present invention are set forth in §4.5.

§ 4.1 Exemplary Environments in which Invention May Operate

The following exemplary environments are presented to illustrateexamples of utility of the present invention and to illustrate examplesof contexts in which the present invention may operate. However, thepresent invention can be used in other environments and its use is notintended to be limited to the exemplary environment.

FIG. 1 is a block diagram of an environment 100 in which the presentinvention may be used. Various components are coupled with (i.e., cancommunicate with) a network(s) 110, such as an Internet protocol (“IP”)based network. A file system 120(1), 120(2) may include one or more fileservers 122. One or more portal units 130 permit one or more clients 140to use the file system(s). The clients 140 needn't be provided with anyspecial front-end software or application. From the perspective of aclient 140, the file system 120(1), 120(2) is a virtual single storagedevice residing on the portal. Combined file server and portal units 150are possible. Administration 160 of the file servers and portals may becentralized. Administrative information may be collected from the units122, 130, 150 and distributed to such units 122, 130, 150 in apoint-to-point or hierarchical manner. As shown, the environment 100 cansupport multiple file systems 120(1), 120(2) if desired. As illustrated,a single file server 122 b may belong to/support more than one filesystem.

§ 4.2 Functions that May be Performed by the Present Invention

The present invention may function to (i) permit storage capacity to beadded easily (as is the case with NASs), (ii) to permit file systems tobe expanded beyond a given unit (as is the case with SANs), (iii)provide a file system that is easy to administer and manage, (iv) permitdata sharing, and (v) provide all this functionality in a way thatremains efficient at very large capacities and client loads. The presentinvention may do so by (e.g., automatically) disseminating (e.g., state)information from a newly added unit to central administration andmanagement operations, and by (e.g., automatically) disseminating (e.g.,configuration and control) information from such operations back down tothe newly added units, as well as existing units. In this way, a filesystem can span both local storage devices (like disk drives) andnetworked computational devices transparently to clients. Such state andconfiguration and control information can include globally managedsegments as the building blocks of the file system, and a fixed mappingof globally unique file identifiers (e.g., Inode numbers) and/or rangesthereof, to such segments.

Having introduced functions that may be performed by the presentinvention, exemplary operations, data structures, methods and apparatusfor effecting these functions are described in § 4.3 below.

§ 4.3 Exemplary Operations, Data Structures, Methods and Apparatus forEffecting Functions that May be Performed by the Present Invention

In the following, exemplary operations that may be performed by thepresent invention, and exemplary data structures that may be used by thepresent invention, are introduced in § 4.3.1 with reference to FIGS.2-4. Then, exemplary methods for effecting such operations are describedin § 4.3.2 with reference to FIGS. 5-8. Finally, exemplary apparatusthat may be used to effect the exemplary processes and store theexemplary data structures are described in § 4.3.3 with reference toFIG. 9.

§ 4.3.1 Exemplary Operations and Data Structures

FIG. 2 is a process bubble diagram of operations that may be carried outby various exemplary apparatus used in the environment of FIG. 1. Theapparatus include a portal 230, a file server 222, and/or a combinedfile server and portal 250. Each of these units may be coupled with oneor more networks 210 that facilitate communications among the units. Oneor more file system administration units 240 may be used to gatherinformation about units added to a file system, and disseminate systemcontrol information to all of the units (e.g., supporting portalfunctions) of a file system. Such information gathering anddissemination may take place over the network(s) 210, or some othernetwork.

Referring first to the file server 222, the local file operation(s) 226a represents the typical core functionality of a file system includingreading and writing files, inserting and deleting directory entries,locking, etc. The details of the implementation of this file system arenot important outside of the characteristics and behavior specifiedhere. The local file operation(s) 226 a translates given requests intoinput/output (“I/O”) requests that are then submitted to a peripheralstorage interface operation(s) 228 a. The peripheral storage interfaceoperation(s) 228 a processes all the I/O requests to the local storagesub-system 229 a. The storage sub-system 229 a can be used to store datasuch as files. The peripheral storage interface operation(s) 228 a maybe used to provide data transfer capability, error recovery and statusupdates. The peripheral storage interface operation(s) 228 a may involveany type of protocol for communication with the storage sub-system 229a, such as a network protocol for example. File operation requestsaccess the local file operation(s) 226 a, and responses to such requestsare provided to the network(s) 210, via network interface operation(s)224 a.

Referring now to the portal 230, a client (user) can access the filesystem of the present invention via an access point 238 a in a filesystem call translation operation(s). One way for this entry is througha system call, which will typically be operating system specific andfile system related. The file system call translation operation(s) 232 acan be used to convert a file system request to one or more atomic fileoperations, where an atomic file operation accesses or modifies only onefile object. Such atomic file operations may be expressed as commandscontained in a transaction object. If the system call includes a fileidentifier (e.g., an Inode number), the file system call translationoperation(s) 232 a may also be used to determine a physical part of astorage medium of the file system corresponding to the transaction(e.g., a segment number) from a (globally) unique file identifier (e.g.,Inode number). The file system call translation operation(s) 232 a mayinclude a single stage or multiple stages. This file system calltranslation operations may also contain local cache 233 a. This localcache 233 a may include a local data cache, a cache of file locks andother information that may be frequently needed by a client, or by aprogram servicing a client. If a request cannot be satisfied using localcache 233 a, the file system translation operation(s) 232 a may forwardthe transaction object containing atomic file operation commands to thetransaction routing operation(s) 234 a.

The transaction routing operation(s) 234 b uses the (globally) uniquefile identifier (e.g., Inode number) associated with each atomic fileoperation command, or the physical part of file system (e.g., thesegment number) derived there from, to determine the location (e.g., theIP address) of a file server 222/250 that is in charge of the uniquelyidentified file. This file server can be local (i.e., a unit acting asboth a portal and a file server, that received the request) or remote.If this file server is local, the transaction routing operation(s) 234 bsimply passes the file operation to the local file operation(s) 226 bwhich, in turn, passes an appropriate command(s) to the peripheralstorage interface operation(s) 228 b for accessing the storage medium229 b. If, on the other hand, the file server is remote, the network(s)210 is used to communicate this operation. The system is independent ofany particular networking hardware, protocols or software. Allnetworking requests are handed over to a network interface operation(s)236 b.

The network interface operation(s) 224/236 services networking requestsregardless of the underlying hardware or protocol, and is used toforward the transaction towards the appropriate file server 222. Thenetwork interface operation(s) 224/236 may provide data transfer, errorrecovery and status updates on the network(s) 210.

Referring now to FIG. 3, rather than using a disk (or some otherdiscrete storage unit or medium) 310 as a fundamental unit of a filesystem, an exemplary embodiment of the present invention employs asmaller unit, referred to as a “segment” 340. A segment 340 is acontiguous range of disk (or other storage medium) memory with apredetermined maximum size (e.g., 64 gigabytes (“GB”) in one exemplaryembodiment). The actual target size for a segment is configurable. Inone exemplary embodiment, the target size is four (4) GB. In such anembodiment, a typical single disk drive with a capacity of, for example,50 GB, would contain between one and a dozen segments. The actual sizesof segments can vary from disk (or other storage medium) to disk (orother storage medium).

To determine what each disk (or some other storage medium) contains, asuperblock 330 is added at a fixed address. This superblock 330 containsa map of all the segments 340 residing on this disk (or some otherstorage medium). Such a map may list the blocks 350 where the segmentsstart. The superblock 330 may also associate the file system(s) with thesegments that belong to the file system. The superblock may beduplicated for fault-tolerance either on the same disk (or some otherstorage medium) or a different one.

In the file system of the present invention, a file or Inode stored on adisk (or some other storage media) may be addressed by (i) a segmentnumber, and (ii) a block number within the segment. The translation ofthis address to a physical disk address need only occur only at (or by)the lowest level, by the peripheral storage interface operation(s)(e.g., thread) 228 of the appropriate file server 222/250. None of thebasic file system functionality needs to know anything about which disk(or other storage medium) the segment resides on, or whether or not twosegments are on the same physical hardware. That is, the client and filesystem calls from the client don't need to know anything about whichdisk (or other storage medium) a segment is on (or even the segment forthat matter). Neither, in fact, do the local file operations 226 need toknow anything about the disk (or other storage medium) that a givensegment resides on.

In accordance with the present invention, within a file system, each(globally) unique file identifier (“FID”) (e.g., an Inode number) isassociated with a single controlling segment, though each segment canhave more than one associated FID (e.g., Inode number). The FIDs (e.g.,Inode numbers) can be associated with their segments in a simple fixedmanner. For example, in an exemplary embodiment of the presentinvention, any segment has a fixed number of Inode numbers that itpotentially can (i.e., may) store.

For example, for a maximum segment size of 64 GB, the fixed number ofInodes per segment may be 8,388,608 (this number comes from dividing the64 GB maximum segment size by an average file size of 8 KB). In thisexemplary embodiment, the segment number can be used to determine theactual ranges of Inode numbers controlled by a segment in the filesystem. For example, the first segment (number 0) of a file system wouldhave Inode numbers 0 through 8,388,607. The second segment would haveInode numbers 8,388,608 through 16,777,215, and so on. The root Inode(directory) of a file system is assigned the number 1 by convention(Inode 0 is not used) and, of course, resides on the first segment. Notethat the foregoing numbers represent the maximum ranges of Inodes that agiven segment may control—the actual numbers of Inodes that have beenallocated will generally be much smaller.

An Inode in the present invention may have essentially the sameproperties as that of a traditional file system Inode. A number uniquelyidentifies the Inode, which in an exemplary embodiment is a 64-bitquantity. The Inode may contain key information about a file ordirectory such as type, length, access and modification times, length,location on disk, owner, permissions, link-count, etc. It may alsocontain additional information specific to the particular file system.

On disk (or other storage medium), Inodes may be maintained in Inodeblocks (or groups). The Inode blocks themselves may be quite simple. Inone exemplary implementation, they simply include a bitmap showing whichInodes in the block are free, a count of free Inodes, and the array ofInodes themselves, as many as fit in the block.

As noted above, each segment of the file system is responsible for afixed set of Inode numbers. This principle is repeated within thesegment—that is, segments may be of varying size, but they are alwaysmade up of some multiple of the smallest file system unit, namely theSubsegment. Within the segment, each Subsegment is again responsible fora fixed subset of the Inodes in the segment.

The data-centric nature of the file system of the present invention, andthe advantages of such a data-centric file system can be appreciatedfrom the fact that essentially every operation that can be performed ona file system is associated with some single (globally) unique FID(e.g., Inode number). In the exemplary embodiment, to determine wherethat file is stored, and hence where the operation needs to beperformed, simply dividing the Inode number by the constant 8,388,608yields the segment number. (If the result is not a whole number, it istruncated to the next lower whole number. For example, if the Inodenumber divided by the constant was 1.983, the segment number would be1.)

This convention also makes it simple to distribute the file system overmultiple servers as well—all that is needed is a map of which segmentsof the file system reside on which host file server. More specifically,once the segment number is derived from the Inode number, theappropriate file server can be determined by mapping, such as a routingtable. In the simplest case, this map is simply a table that lists thefile servers (on which the local agents execute) corresponding toparticular segments. In one exemplary embodiment, the file server isidentified by its IP address. More generally, file servers may beorganized in groups, in a hierarchy, or in some other logical topologyand the lookup may require communication over the network with a groupleader or a node in a hierarchy. For efficiency, such information may becached on a leased basis with registration for notification on changesto maintain coherency. The local file operation(s) 226 and peripheralstorage operation(s) 228 at the determined file server can thendetermine the file to which an operation pertains. Once the request hasbeen satisfied at the determined file server, the result is sent back tothe original (portal) server (which may be the same as the determinedfile server). The original (portal) server may then return the result tothe requesting client.

In one exemplary embodiment of the present invention, each (globally)unique FID (e.g., Inode) resides in a segment referred to as the“controlling segment” for that FID (e.g., Inode). As is understood inthe art, an Inode is associated with each file and encloses keyinformation about the file (e.g., owner, permissions, length, type,access and modification times, location on disk, link count, etc.), butnot the actual data. In the exemplary embodiment of the presentinvention, the data associated with an Inode may actually reside onanother segment (i.e., outside the controlling segment of the Inode).However, the controlling segment of a particular Inode, and thesegment(s) containing the data associated with the particular Inode,will be addressable and accessible by the controlling file server. Agroup of segments that is addressable and accessible by a given fileserver are referred to as a “maximal segment group”. Thus, the Inode andits associated data (e.g., the contents of the file) are containedwithin a maximal segment group.

At any given time, a segment is under the control of at most one localagent (i.e., residing on the local file server). That agent isresponsible for carrying out file system operations for any FIDcontrolled by that segment. The controlling segment's unique identifier(“SID”) for each FID is computable from the FID by the translator usinginformation available locally (e.g., in the superblock 330). In theforegoing exemplary embodiment, the controlling SID may be computedsimply via integer division of the FID by a system constant, whichimplies a fixed maximum number of files controlled per segment. Otheralgorithms may be used.

Data from a file may be contained in a segment in the maximal segmentgroup which is not under the control of the file server responsible forthe controlling segment. In this case, adding space to or deleting spacefrom the file in that segment may be coordinated with the file serverresponsible for it. No coordination is necessary for simple readaccesses to the blocks of the file.

Client (user) entry and access to the entire file system may thus occurthrough any unit that has translation and routing operations, and thathas access to a segment location map. Such units may be referred to as“portals.” Multiple simultaneous access points into the system are anormal configuration of the file system. Note that a portal unit willnot need a file system call translator operation(s) 232, assuming thatsuch operations are provided on the client (end user) machines. However,such a configuration will require software installation and maintenanceon a potentially large number of machines.

§ 4.3.2 Exemplary Methods

Exemplary methods that may be used to effect some of the operationsintroduced in § 4.3.2 above, are now described.

FIG. 5 is a flow diagram of an exemplary method 232 b′ that may be usedto effect a file system call translation operation 232 b. A file systemcall is accepted, as indicated by block 510. It is assumed that the filesystem call includes some type of globally unique file identifier(“FID”), such as an Inode number for example. Note that such a globallyunique identifier will typically not be included when a file (or othercomponent such as a directory or folder) is first provided (e.g.,written) to the file system. As shown by conditional branch point 515and block 525, if this is the case, a globally unique identifier (e.g.,an Inode number) is assigned. Such assignment may be based on policiesand/or global file system state information. Next, as shown in block520, the relevant segment number is determined based on the unique FID(e.g., Inode number) of the file to which file system call pertains.Recall that this may be done by dividing an Inode number by some fixednumber (and truncating to the next lower whole number if a remainderexists) in one embodiment. Then, a file system transaction is generatedbased on the file system call, as indicated by block 530. That is, afile system call from a client may have a particular format or syntax.If necessary, information from this file system call is simplyreformatted into the appropriate syntax used in the distributed filesystem. This syntax may be a transaction object containing one or moreso-called atomic file operation commands.

At conditional branch point 540, it is determined whether or not thetransaction (or parts thereof) can be completed using the local cache(assuming that such a local cache is provided). If so, the transaction(or parts thereof) is completed locally, as indicated by block 550, andthe method 232 b′ is left via RETURN node 570. Otherwise, thetransaction (or parts thereof) is forwarded to a routing operation asindicated by block 560, and the method 232 b′ is left via RETURN node570.

FIG. 6 is a flow diagram of an exemplary method 234′ that may be used toeffect a transaction routing operation 234. As indicated by block 610,the segment number is used to determine (e.g., mapped to) a serveridentifier or address (or at least to another machine that can map thesegment to a server identifier or address). The server identifier oraddress may be an Internet protocol (“IP”) address. For example, in theexemplary data structure 235′ of FIG. 4 if the segment number (or a partthereof not masked out by a mask 414) matches a stored segment number422, or falls within a range of segment numbers 412, the appropriatefile server location, or partial file server location, 416 can bedetermined. Such a table may be manually or automatically populated(e.g., using file system administration 240) in a variety of ways, manyof which will be apparent to those skilled in the art. For example,segment number-file server (address) associations can be manuallytracked, and provisioned manually, by some global (i.e., file systemwide) administrative authority. Each portal could then be manuallyconfigured using information from the administrative authority. On theother end of the spectrum, some automated signaling and network statedistribution protocols, such as those commonly used by routers forexample, may be used to collect file server information, provisionsegment numbers to that file server, and distribute segment number-fileserver associations to all portal units.

Referring back to FIG. 6, at conditional branch point 620, it isdetermined whether or not the portal/server is the same file server asthat identified. That is, whether or not the transaction is to beperformed locally is determined. This can only be the case when portaland file server functionality is provided on a machine with a singleaddress for such purposes. (Recall, e.g., the file server and portal 250of FIG. 2.) If so, the transaction is passed to the local peripheralstorage interface operation(s) via the local file operation(s), asindicated by block 630. (Recall, e.g., operations 226 b and 228 b ofFIG. 2.) The method 234′ is then left via RETURN node 650.

Referring back to conditional branch point 620, if it is determined thatthe file server identified differs from the portal machine, thetransaction is passed to network interface operation(s) 640, before themethod 234′ is left via RETURN node 650.

FIG. 7 is a flow diagram of an exemplary method 236′ that may be used toeffect a network interface operation 236. Upon receipt of a transaction,the transaction is “packaged” for forwarding towards the appropriatefile server, as indicated by block 710. For example, if the appropriatefile server has an IP address, the transaction may be carried as data inan IP packet. The packaged transaction is then forwarded towards theappropriate file server based on the file server address information, asindicated by block 720. The method 236′ may then be left via RETURN node730. A complementary method 224′, not shown, can be used to unpackage atransaction (and save the address of the portal server) when it reachesthe appropriate file server.

FIG. 8 is a flow diagram of an exemplary method 226′ that may be used toeffect local file operations 226. First, as indicated by block 810, therequest is translated into input/output requests. These requests arethen submitted to the peripheral storage operation(s) 820. The method226′ is then left via RETURN node 830.

Having described various exemplary methods that may be used to effectvarious operations, exemplary apparatus for effecting at least some ofsuch operations are described in § 4.3.3 below.

§ 4.3.3 Exemplary Apparatus

FIG. 9 is high-level block diagram of a machine (e.g., a computer, apersonal computer, a hand-held computer, a network server, etc.) 900that may effect one or more of the operations discussed above. Themachine 900 basically includes a processor(s) (e.g., microprocessors,ASICs, etc.) 910, an input/output interface unit(s) 930, a storagedevice(s) (e.g., RAM, ROM, disk-based storage, etc.) 920, and a systembus or network 940 for facilitating the communication of informationamong the coupled elements. An input device(s) 932 and an outputdevice(s) 934 may be coupled with the input/output interface(s) 930.

The processor(s) 910 may execute machine-executable instructions toeffect one or more aspects of the present invention. At least a portionof the machine executable instructions may be stored (temporarily ormore permanently) on the storage device(s) 920 and/or may be receivedfrom an external source via an input interface unit 930.

§ 4.4 Examples of Operations of Exemplary Embodiment

In an exemplary embodiment of the present invention, every basic filesystem function, whether client-oriented (e.g., read, write, etc.) orsystem-oriented (e.g., format disk, create file system, de-fragmentdisk, etc.) is viewed as a simple transaction object containing (atomicfile operation) command substructures with slots for input parametersand results. The thread, which generates the transaction, will know howto set or read these input/output slots.

In the exemplary embodiment, each transaction type can be thought of ashaving two functions associated with it—a processing function and apackaging function. The processing function has two modes—a query modeand a normal mode. In the query mode, the function simply provides thecaller (the main thread) with the file system and controlling Inodenumber of a specific transaction to be used to determine where thetransaction must be processed. In the normal mode, the function performswhatever work is necessary to satisfy the file-system function. Thepackaging function handles packaging or un-packaging the input or outputdata of the transaction for transport between (portal and file server)hosts.

In addition, in the exemplary embodiment, each transaction hasfunction-specific state variables used by the processing function. Eachprocessing function is written to carefully track progress in executingthe file system request so that at any point it may pass control overthe transaction object to another process while awaiting a requiredresource, and then resume execution where it stopped when the resourcebecomes available. In effect, then, transactions are implemented asatomic file operations. These atomic file operations “block”individually, so that the threads themselves never have to.

To better understand how to read or write pages on disk (or some otherstorage medium), examples of operations of an exemplary embodiment ofthe present invention is now described. More specifically, an example ofa file read is described in § 4.4.1 with reference to FIG. 10. Then, anexample of a file write is described in § 4.4.2 with reference to FIG.11.

In both cases, it must be understood that whenever a transaction needsto wait for a resource, such as a file for example (e.g., because itneeds to be read from the disk, or because another transaction has itlocked), the transaction may be queued while it waits for the resourceto become available. In one embodiment, within the transaction(processing routine) itself, a transaction pointer is set to NULL.Whether or not the transaction pointer is valid (not NULL and with noerror code) may be checked constantly. In addition, state variableswithin the transaction command structure may be maintained so that whenthe resource becomes available and the transaction is placed back intothe execution queue, the transaction starts at the appropriate place(e.g., at the appropriate atomic file operation).

The following description follows the logic in the following codefragment. The command pointer is a pointer to some typical transactionstructure (referred to as an Ibrix transaction structure without loss ofgenerality) with the following members (perhaps among others). Assumethat all members have been initialized to zero or NULL.   Typical_Cmd { /* Inputs */  File_System *fs; /* Current file system */  int iseg; /*Segment of sought block */  int address; /* Address of sought block */ /* Workspace variables */  Cpage *page;  int have_lock; };  #defineNORMAL_PAGE 0

In each of the following examples provided in §§ 4.4.1 and 4.4.2, it isassumed that the relevant file (e.g., the relevant Inode and the dataassociated with the relevant Inode) is stored on, or is to be stored on,a (filer server) unit other than the (portal) unit receiving the filecall.

§ 4.4.1 Example of a Read Operation

FIG. 10 illustrates communications between operations in an exemplarysequence for executing a file system read request. A client (user)process (not shown) issues a file system call 1005 which is accepted bythe file system call translation operation(s) (e.g., thread) 232. It 232translates the clients file system call to a file system call having atransaction object-command syntax. More specifically, the request entersa file system call translation operation(s) (e.g., thread) 232. Theoperation(s) 232 allocates a transaction structure, fills in theoperating system specific command substructure with the inputparameters, and forwards the transaction to the transaction routingoperation(s) 234 (e.g., places the transaction in the transaction routeroperation(s) thread input queue) as indicated by communication 1010.

The transaction routing operation(s) (e.g., thread) 234 calls theappropriate processing routine in query mode, obtaining the controllingInode number of the request and the file system. It 234 computes thesegment on which the controlling Inode resides. Then as indicated bycommunications 1015 and 1020, using the segment number, and the segmentto file server address map 235, it 234 determines a server address. Asstated above, in this example, the file server is remote. Since thesegment is determined to be on another (file server) unit, thetransaction routing operation(s) (e.g., thread) 234 marks thedestination in the transaction and forwards the transaction to thenetwork interface operation(s) 236 (e.g., puts the transaction on theinput queue of the network interface operation(s) thread) as indicatedby communication 1025.

The network interface operation(s) (e.g., thread) 236 calls thepackaging routine on the transaction and forwards the packagedtransaction to the appropriate (file server) unit as indicated bycommunication 1030. At the appropriate (file server) unit, the networkinterface operation(s) (e.g., thread) 224 calls the packaging routine toun-package the transaction and passes the transaction on to its localfile system operation(s) (e.g., thread) 226 as indicated bycommunication 1035.

When the local file system operation(s) (e.g., thread) 226 determinesthat a read transaction is to be processed on the current machine,possibly after it has received the read transaction from another machinevia the network interface operation(s) (e.g., thread) 224, it 226 thenuses the normal mode processing routine to satisfy the file systemfunction. This may actually involve multiple cycles through theprocessing function as the read transaction must typically wait forvarious resources to come available at different points in the function.(See, e.g., communications 1040 and 1045). As described below, the readtransaction, performed at the local file operation(s) 226, may includepin, lock, and read&release commands.

The following illustrates three stages of an exemplary read operation:Xaction *typical_read (Xaction *xact, Inum *inum) {    . . .  /* 1. Pinthe page */  if (X_VALID(xact)) {   Segment *segment =cmd->fs->segments[cmd->iseg];   int n_pages = 1;   cmd->page =pin_or_load_pages(&xact, cmd->address, NORMAL_PAGE, n_pages, segment);  /* NOTE: xact may now be NULL! */  }  /* 2. Lock for reading */  if(X_VALID(xact) && cmd->have_lock == 0) {   if (set_read_lock(cmd->page,PAGE_INDEX(cmd->address), &xact)) {    cmd->have_lock = 1;   }  /* NOTE:xact may be NULL here! Note that in version   * 1.0, set_read_lock takesa pointer to the xact,   * not its address. In that case, we must havean   * else clause which explicitly sets xact to NULL   */  }  /* 3.Read & release */  if (X_VALID(xact)) {   char *buf =cmd->page->pages[PAGE_INDEX(cmd->address)];    . . . /* Read the buffer*/   unset_read_lock(cmd->page, PAGE_INDEX(cmd->address));  unpin_cache_page(cmd->page, PAGE_INDEX(cmd->address));   xact->info |=IB_DONE;  }  return xact; }

The first stage of the read command, indicated by:

-   -   /*1 Pin the page*/        loads the page into the cache from the disk and pins it for use.        This first stage is quite simple, but it is also the most        frequent type of action to take in any transaction in the file        system of the present invention. Whether a page is to be read or        modified, the system must first get it. A routine checks whether        the page already exists in the local cache. If so, it attempts        to pin it on behalf of the calling transaction. If the page is        not available from the local cache, the routine generates a        request to load the page to the disk and places the request on        the input queue of a peripheral storage interface operation(s)        thread. The transaction pointer is also recorded in the load        request so that the thread may place it in the wait queue of the        page once it has created it. Once recorded, the pointer is set        to NULL. Note that the pointer may also become NULL if the page        existed in the local cache, but the pin failed. These two events        cannot be distinguished from the calling process.

Assuming that the first time through, the page was not available in thelocal cache, the transaction will be placed back in the local filesystem operation(s) thread queue once the page has been loaded. Notethat the same instructions as before will be re-executed, but this timethey will succeed.

In the second stage of the read command, indicated by:

-   -   /*2. Lock for reading*/        the page is locked so that the contents can be read without any        danger of another thread modifying the page during such a read.        The function that sets the lock performs in the same manner as        the pin function.

An additional state variable (cmd->have_lock) is introduced. This statevariable is not absolutely necessary in the example routine as writtenhere, since there are no subsequent places in the routine where thetransaction will have to wait on a queue. However, in general, it may benecessary to introduce some state variable to ensure that the same lockis not retried on a subsequent entry into the routine on the sametransaction.

Once the page is locked by the transaction, in a third stage of the readcommand, indicated by:

-   -   /*3. Read & release*/        the page is read. Once done with the read, the transaction will        release the lock and unpin the page. Note that, if further use        of the same page is anticipated, the transaction might unset the        read lock, but not unpin. It is then important to ensure that        when the transaction is done, it will then unpin the page.

Once done (See, e.g., communications 1050 and 1055), the readtransaction (i.e., the file, etc.) is passed back to its source. Theread transaction may go directly to the file system call translationoperation(s) (e.g., thread) 232, and thence to the client (user) thatmade the original file system call. Alternatively, the transaction maypass through the network interface operations (e.g., threads) 224 and246 to be passed back to the original (portal) unit, and thence to thefile system call translation operation(s) (e.g., thread) 232 there (asindicated by communications 1060, 1065 and 1070), and then to the client(user) that made the original file system call.

§ 4.4.2 Example of a Write Operation

FIG. 11 illustrates communications between operations in an exemplarysequence for executing a file system write request. A client (user)process (not shown) issues a file system call 1105 which is accepted bythe file system call translation operation(s) (e.g., thread) 232. It 232translates the clients file system call to a file system call having atransaction object-command syntax. More specifically, the request entersa file system call translation operation(s) (e.g., thread) 232. Theoperation(s) 232 allocates a transaction structure, fills in theoperating system specific command substructure with the inputparameters, and forwards the transaction to the transaction routingoperation(s) 234 (e.g., places the transaction in the transaction routeroperation(s) thread input queue) as indicated by communication 1110. Ifthe file hasn't yet been written to the file system, the file systemcall translation operation(s) 232 may assign a globally unique fileidentifier (FID) (e.g., an Inode number). Such FID (Inode number)assignment may be based on policies and/or a global state of the filesystem.

The transaction routing operation(s) (e.g., thread) 234 calls theappropriate processing routine in query mode, obtaining the controllingInode number of the request and the file system. It 234 computes thesegment on which the controlling Inode is to reside. Then as indicatedby communications 1115 and 1120, using the segment number, and thesegment to file server address map 235, it 234 determines a serveraddress. As stated above, in this example, the file server is remote.Since the segment is determined to be on another (file server) unit, thetransaction routing operation(s) (e.g., thread) 234 marks thedestination in the transaction and forwards the transaction to thenetwork interface operation(s) 236 (e.g., puts the transaction on theinput queue of the network interface operation(s) thread) as indicatedby communication 1125.

The network interface operation(s) (e.g., thread) 236 calls thepackaging routine on the transaction and forwards the packagedtransaction to the appropriate (file server) unit as indicated bycommunication 1130. At the appropriate (file server) unit, the networkinterface operation(s) (e.g., thread) 224 calls the packaging routine toun-package the transaction and passes the transaction on to its localfile system operation(s) (e.g., thread) 226 as indicated bycommunication 1135.

When the local file system operation(s) (e.g., thread) 226 determinesthat a write transaction is to be processed on the current machine,possibly after it has received the write transaction from anothermachine via the network interface operation(s) (e.g., thread) 224, it226 then uses the normal mode processing routine to satisfy the filesystem function. This may actually involve multiple cycles through theprocessing function as the read transaction must typically wait forvarious resources to come available at different points in the function.(See, e.g., communications 1140 and 1145). As described below, the writetransaction, performed at the local file operation(s) 226, may includepin, lock, and write&dirty commands.

The first two stages of modifying an existing disk block (as opposed toallocating a new block to write to) are essentially identical to thefirst two stages of the read transaction described in § 4.4.1 above,except that the lock request is set_write_lock rather thanset_read_lock. Only the code beginning at stage 3 is shown. Xaction*typical_write (Xaction *xact, Inum *inum) {    . . .  /* 1. Pin thepage - as in read */  /* 2. Lock for writing - analogous to read */  /*3. Write & dirty */  if (X_VALID(xact) && cmd->did_write == 0) {   char*buf = cmd->page->pages[PAGE_INDEX(cmd- >address)];    . . . /* Makechanges to the buffer */   unset_write_lock(cmd->page,PAGE_INDEX(cmd- >address),        (IBC_PAGE_DIRTY|IBC_PAGE_LOCK_FLUSH));  cmd->did_write = 1;   wait_on_page_queue(cmd->page, &xact);   /* NOTE:xact is now NULL! */  }  if (X_VALID(xact) && cmd->did_write) {   int iw= PAGE_INDEX(cmd->address);   if (cmd->page->info[iw] & IBC_PAGE_READY){    /* We are DONE! */    unset_flush_lock(cmd->page, iw);   unpin_cache_page(cmd->page, iw);    xact->info |= IB_DONE;   }   else{    wait_on_page_queue(cmd->page, &xact);   }  }  return xact; }

The difference from reading occurs at the point when the transactionunlocks the page. Unlike reading, writing changes the contents of thepage. Thus, when the transaction unlocks the page, the cache is informedthat the transaction modified the page. This may be done by passing theIBC_PAGE_DIRTY flag. Setting this flag on the page will cause it to beplaced in the dirty-page queue to be written to disk the next time thecache thread executes.

If it is desired to confirm that a write of the new data has actuallyoccurred, along with the IBC_PAGE_DIRTY flag, the transaction may alsoset a flush lock. (See, e.g., communications 1160, 1165 and 1170.)Typically, the page cannot actually be written until the transactionexits and the cache thread executes, so transaction explicitly placesitself on the page wait queue.

Once the write has occurred, the transaction will be placed back on thelocal file system operation(s) thread's input queue and it will reenterthis routine. The transaction can verify that it is indeed here becausethe write completed (by checking the PAGE_READY flag). If not, thetransaction can re-insert itself on the page queue. If so, thetransaction can unset the flush lock, unpin the page and exit.

Note that if verifying the write is not necessary or not desired, thenin the third stage of the transaction could have done the following:  /*3. Write & dirty */  if (X_VALID(xact) && cmd->did_write == 0) {   char*buf = cmd->page->pages[PAGE_INDEX(cmd- >address)];    . . . /* Makechanges to the buffer */   unset_write_lock(cmd->page,PAGE_INDEX(cmd- >address),       IBC_PAGE_DIRTY);  unpin_cache_page(cmd->page, PAGE_INDEX(cmd- >address));   xact->info|= IB_DONE;  }

As before, the cache will process the write, but it is not confirmed.

In one exemplary embodiment of the present invention, the maximum lengthof a file name is 8191 characters (i.e., one file system block). Withinthe directory structure itself, however, only 42 (constantMAX_FAST_NAME_SIZE) characters may be recorded. If the name exceeds thissize, it is replaced by a 16-byte number computed by a message digestalgorithm (the MD5 algorithm, See, e.g., RFC 1321 which is incorporatedherein by reference) for lookup/comparison purposes, plus a pointer to ablock containing the name itself.

§ 4.5 CONCLUSIONS

As can be appreciated from the foregoing, the present invention teachesa file system that can span a disparate mix of heterogeneous units suchas portals, files servers, and combinations thereof. These units areconnected over one or more networks network and are managed andadministered on a global basis. Consequently, the a system implementedin accordance with the present invention allows the transparent additionof any resources into the overall system without the need for planningor downtime.

As far as a client (user) is concerned, the entire file system resideson a portal unit. As long as the protocols used by the client employfile-locking procedures, any or all servers participating in a filesystem of the present invention may act as portal machines.

The Inode mapping convention allows a distributed file system to berealized.

1. For use with a distributed file system in which files are distributedacross more than one file server, each file server having physicalstorage media, a method for determining a particular file server towhich a file system call pertains, the method comprising: a) accepting afile system call including a file identifier; b) determining acontiguous unit of the physical storage media of the file servers of thedistributed file system based on the file identifier; c) determining thefile server having the physical storage media that contains thedetermined contiguous unit; and d) forwarding a request, based on thefile system call accepted, to the file server determined to have thephysical storage media that contains the determined contiguous unit. 2.The method of claim 1 wherein the file identifier is an Inode number. 3.The method of claim 1 wherein the contiguous unit is a segment.
 4. Themethod of claim 1 wherein the file identifier is a number, wherein thecontiguous unit is a segment, and wherein the segment is determined bydividing the file identifier number by a predetermined number. Can thisbe generalized? All that is required is an algorithm which generates aunique mapping of a range of inode numbers to an identifying segmentnumber. Division by a constant is only one such method.
 5. The method ofclaim 1 wherein the file server having the physical storage media thatcontains the determined contiguous unit is determined by a table,administered globally across the file system, that maps the contiguousunit to the file server.
 6. The method of claim 5 wherein the table mapsthe contiguous unit to an address of the file server.
 7. The method ofclaim 6 wherein the address is an Internet protocol address.
 8. For usewith a distributed file system in which files are distributed acrossmore than one file server, each file server having physical storagemedia, a machine readable medium having stored thereon a data structure,the data structure comprising: a) a first field for storing anidentifier of a contiguous unit of the physical storage media of thefile servers of the distributed file system; and b) a second field forstoring an identifier of a file server having the physical storage mediathat contains the contiguous unit identified by the first field.
 9. Thedata structure of claim 8 wherein the identifier of a contiguous unit isa segment number.
 10. The data structure of claim 8 wherein theidentifier of a file server is an address of the file server.
 11. Thedata structure of claim 10 wherein the address is an Internet protocoladdress.
 12. The data structure of claim 8 wherein instances of the datastructure are stored on machine readable media on each of a number ofservers used to access the distributed file system.
 13. For use with adistributed file system in which files are distributed across more thanone file server, each file server having physical storage media, aserver for providing a access point to the distributed file system, theserver comprising: a) an input for accepting a file system callincluding a file identifier; b) a translator for determining acontiguous unit of the physical storage media of the file servers of thedistributed file system based on the file identifier; c) a router fordetermining the file server having the physical storage media thatcontains the determined contiguous unit; and d) a network interface forforwarding a request, based on the file system call accepted, to thefile server determined to have the physical storage media that containsthe determined contiguous unit.
 14. The server of claim 13 wherein thefile identifier is an Inode number.
 15. The server of claim 13 whereinthe contiguous unit is a segment.
 16. The server of claim 13 wherein thefile identifier is a number, wherein the contiguous unit is a segment,and wherein translator determines the segment by dividing the fileidentifier number by a predetermined number.
 17. The server of claim 13further comprising a table, used by the translator to determine the fileserver having the physical storage media that contains the determinedcontiguous unit.
 18. The server of claim 17 wherein the table maps thecontiguous unit to an address of the file server.
 19. The server ofclaim 18 wherein the address is an Internet protocol address.
 20. Theserver of claim 17 wherein the table is administered globally across thefile system.
 21. The server of claim 17 wherein the table maps thecontiguous unit to the file server.
 22. The server of claim 13 furthercomprising: e) a local cache; and f) means for determining whether ornot the file system call can be satisfied using the local cache.
 23. Theserver of claim 13 wherein the file identifier is a number, wherein thecontiguous unit is a segment, and wherein translator determines thesegment by an algorithm that yields a unique integer for a predeterminedrange of file identifier numbers.